Skip to content

Add OCI example tests for OFED userspace tools and iperf3 #477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 31, 2025

Conversation

trentindav
Copy link
Contributor

Description

In the context of testing OFED on OCI instances I prepared a list of CLI tools that must be installed by the package.
I was suggested to add tests to validate that these commands are installed in the example/oracle/oracle-example-cluster-test.py . Note that these tests simply check that no error is returned, and that stdout includes a few expected keywords.
The CLI tools covered by the tests are:

  • mst
  • mlxconfig
  • mlxfwmanager
  • flint
  • mlxfwreset

I also added a test to check what transmission performance is measured by iperf3, the expected throughput gives some room below the 50GBps supported by the physical interface.

Additional Context and Relevant Issues

The original manual testing of the OFED packages comes from ATLA-29

Test Steps

I manually executed the newly added tests on running instances where OFED and iperf3 were already installed. All tests pass.

$ tox -e integration-tests -- examples/oracle/oracle-example-cluster-test.py -k "TestOracleClusterOfedTools"
$ tox -e integration-tests -- examples/oracle/oracle-example-cluster-test.py -k "TestOracleClusterPerformance"
$ tox -e format

Copy link
Contributor

@MitchellAugustin MitchellAugustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iperf3 tests are not passing for me on an image with iperf3 installed. Aside from that, everything else works as expected and looks good.

If the iperf tests work on someone else's known, correct configuration, please record that here.

@trentindav
Copy link
Contributor Author

I am not able to reproduce the test_iperf3 failure. I spawned a new cluster network and executed these steps on both hosts before running the test:

  1. Add second VNIC
  2. install MOFED DKMS and userspace PPA
  3. sudo apt install -y iperf3 rdmacm-utils ucx-utils
  4. reboot
  5. run test adding the two instances OCIDs in EXISTING_INSTANCE_IDS
tox -e integration-tests -- examples/oracle/oracle-example-cluster-test.py -k "TestOracleClusterPerformance"
integration-tests: commands[0]> .tox/integration-tests/bin/python -m pytest --log-cli-level=INFO -svv examples/oracle/oracle-example-cluster-test.py -k TestOracleClusterPerformance
========================================================================================================================= test session starts ==========================================================================================================================
platform linux -- Python 3.12.3, pytest-8.3.4, pluggy-1.5.0 -- /home/davide/repo/tools/pycloudlib/.tox/integration-tests/bin/python
cachedir: .tox/integration-tests/.pytest_cache
rootdir: /home/davide/repo/tools/pycloudlib
configfile: pyproject.toml
plugins: xdist-3.6.1, mock-3.14.0, cov-6.0.0
collecting ... 
------------------------------------------------------------------------------------------------------------------------- live log collection --------------------------------------------------------------------------------------------------------------------------
INFO     oci.circuit_breaker:__init__.py:27 Default Auth client Circuit breaker strategy enabled
collected 12 items / 11 deselected / 1 selected                                                                                                                                                                                                                        

examples/oracle/oracle-example-cluster-test.py::TestOracleClusterPerformance::test_iperf3 
---------------------------------------------------------------------------------------------------------------------------- live log setup ----------------------------------------------------------------------------------------------------------------------------
INFO     pycloudlib.cloud.OCI:cloud.py:354 No public key path provided, using: /home/davide/.ssh/id_ed25519.pub
INFO     oracle-example-cluster-test:oracle-example-cluster-test.py:143 Instance ocid1.instance.oc1.phx.anyhqljsniwq6syc5nsrp6gpjsdamswpjs6aluliyypm2v5k25h3ackuc3ia already has a secondary VNIC, not attaching one.
INFO     oracle-example-cluster-test:oracle-example-cluster-test.py:143 Instance ocid1.instance.oc1.phx.anyhqljsniwq6sycbiiymnuolvlp3hm7iryh5xr5rwzqh4vzz74gpkp43tea already has a secondary VNIC, not attaching one.
---------------------------------------------------------------------------------------------------------------------------- live log call -----------------------------------------------------------------------------------------------------------------------------
INFO     pycloudlib.instance:instance.py:285 executing: sh -c 'iperf3 -s -1'
INFO     pycloudlib.instance:instance.py:106 Using ipv4 address: 129.146.167.132
INFO     paramiko.transport:transport.py:1944 Connected (version 2.0, client OpenSSH_9.6p1)
INFO     paramiko.transport:transport.py:1944 Authentication (publickey) successful!
INFO     pycloudlib.instance:instance.py:285 executing: sh -c 'iperf3 -c 10.0.1.99 -P 40 -Z | grep SUM'
INFO     pycloudlib.instance:instance.py:106 Using ipv4 address: 129.146.4.55
INFO     paramiko.transport:transport.py:1944 Connected (version 2.0, client OpenSSH_9.6p1)
INFO     paramiko.transport:transport.py:1944 Authentication (publickey) successful!
INFO     oracle-example-cluster-test:oracle-example-cluster-test.py:450 iperf3 output: [SUM]   0.00-1.00   sec  5.39 GBytes  46.3 Gbits/sec  3326             
[SUM]   1.00-2.00   sec  5.36 GBytes  46.0 Gbits/sec  2993             
[SUM]   2.00-3.00   sec  5.36 GBytes  46.0 Gbits/sec  2759             
[SUM]   3.00-4.00   sec  5.35 GBytes  46.0 Gbits/sec  2745             
[SUM]   4.00-5.00   sec  5.36 GBytes  46.0 Gbits/sec  2633             
[SUM]   5.00-6.00   sec  5.35 GBytes  46.0 Gbits/sec  2931             
[SUM]   6.00-7.00   sec  5.36 GBytes  46.0 Gbits/sec  2641             
[SUM]   7.00-8.00   sec  5.35 GBytes  46.0 Gbits/sec  2662             
[SUM]   8.00-9.00   sec  5.36 GBytes  46.0 Gbits/sec  2883             
[SUM]   9.00-10.01  sec  5.36 GBytes  45.8 Gbits/sec  2857             
[SUM]   0.00-10.01  sec  53.6 GBytes  46.0 Gbits/sec  28430             sender
[SUM]   0.00-10.01  sec  53.6 GBytes  46.0 Gbits/sec                  receiver
iperf3 measured throughput: 46.0
PASSED

=========================================================================================================================== warnings summary ===========================================================================================================================
examples/oracle/oracle-example-cluster-test.py: 132 warnings
  /home/davide/repo/tools/pycloudlib/.tox/integration-tests/lib/python3.12/site-packages/oci/base_client.py:77: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
    return " " + str(datetime.utcnow()) + ": "

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================================================================================================== 1 passed, 11 deselected, 132 warnings in 23.71s ============================================================================================================
  integration-tests: OK (24.44=setup[0.04]+cmd[24.41] seconds)
  congratulations :) (24.50 seconds)

Copy link
Contributor

@MitchellAugustin MitchellAugustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Davide has verified that the iperf3 tests work on his image, and I have verified that all other tests work as expected and look good to me, so I think this should be good to merge.

@a-dubs
Copy link
Contributor

a-dubs commented Mar 3, 2025

@MitchellAugustin thank you for reviewing this!
And thank you @trentindav for putting this PR up. love to see partner engineering team is taking to pycloudlib 🙌
I will review this later today and have some other CPC folks give this a look

@trentindav trentindav force-pushed the oci-mlxtools-tests branch from d9f103a to 2472cd1 Compare March 3, 2025 14:53
@a-dubs a-dubs requested a review from Copilot March 3, 2025 15:19
@trentindav trentindav force-pushed the oci-mlxtools-tests branch from 2472cd1 to b8c7175 Compare March 3, 2025 15:23
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR adds OCI example tests for OFED userspace tools and iperf3 while refactoring RDMA test logic.

  • Introduces a new helper function (ensure_second_vnics_ready) to verify secondary VNICs are present and refactors the RDMA test fixture.
  • Adds tests to validate the installation and basic output checks for mst, mlxconfig, mlxfwmanager, flint, mlxfwreset, and iperf3 performance.

Reviewed Changes

File Description
examples/oracle/oracle-example-cluster-test.py Refactored RDMA tests; added new tests for OFED CLI tools and iperf3

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

examples/oracle/oracle-example-cluster-test.py:129

  • Typo detected in the skip message, 'beiing' should be corrected to 'being'.
pytest.skip("The image beiing used is not RDMA ready")

@trentindav trentindav force-pushed the oci-mlxtools-tests branch 2 times, most recently from dfbc975 to 3afdefc Compare March 6, 2025 19:51
Copy link
Contributor

@a-dubs a-dubs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look great. thanks for working with me to improve some of the various docstrings and organization of the code. 💙

Copy link
Contributor

@uhryniuk uhryniuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delayed comments, finishing my review now!

@trentindav
Copy link
Contributor Author

@MitchellAugustin @a-dubs while validating the latest version of the tests I noticed that I cannot reach the same iperf3 performance I consistently noticed when I first implemented them (I get ~30Gbps compared to the ~45Gbps previously measured). I am using the same custom image as I did during previous tests, so the origin of this difference may be a change in HW or network load.
I am reducing the accepted threshold to let TestOracleClusterPerformance::test_iperf3 pass, unless you believe it this is an issue worth investigating.

@trentindav trentindav force-pushed the oci-mlxtools-tests branch 2 times, most recently from c01a13d to 529269b Compare March 14, 2025 13:15
@MitchellAugustin
Copy link
Contributor

@trentindav

I am reducing the accepted threshold to let TestOracleClusterPerformance::test_iperf3 pass, unless you believe it this is an issue worth investigating

I think it is OK to lower the accepted threshold here since we aren't tailoring this test to reach line rate for any specific device.

In general though, to reach line rate via iperf3 at speeds above 40Gbps, you may need to use multiple iperf3 streams/processes. This is something that we need to do when testing the DGXes at 100Gbps+, since individual iperf3 threads aren't necessarily capable of rates that high.

@trentindav
Copy link
Contributor Author

@MitchellAugustin I am currently using the -P to start 40 parallel streams in the iperf3 client and this was sufficient to reach 45Gbps. I am also using the -Z (zerocopy) option which seemed to help getting more consistent results.

@MitchellAugustin
Copy link
Contributor

@MitchellAugustin I am currently using the -P to start 40 parallel streams in the iperf3 client and this was sufficient to reach 45Gbps. I am also using the -Z (zerocopy) option which seemed to help getting more consistent results.

ah sorry, somehow I missed that

Add test cases to the Oracle cluster example, to validate the presence
of Nvidia firmware CLI tools and to confirm that the throughput measured
by iperf3 is acceptable.
Copy link
Contributor

@uhryniuk uhryniuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@a-dubs a-dubs merged commit 751e920 into canonical:main Mar 31, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants